Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. At least it desighened to be fast :-)

Utils

Dataset upload

The DictReader is a Python class which maps the data read as a dictionary, whose keys, unless specified are the first row of the CSV. All values in the subsequent rows will be dictionary values and can be accessed with the respective dictionary key. However, these values will be imported as strings. This method very useful for data analysis tasks, but it is valuable in certain other cases (working with the JSON format)

Numeric optimization

Uint 8 best choice if we do not have negative values

Objects optimizations

Datetime optimization

Index optimization

Filter_optimization

When chaining multiple operations it is worthwhile to think about which operations to execute first. Filter steps should be executed as early as possible. Even when making inner joins between dataframes, it is worthwhile to filter before merging. If we sample our listings with .sample(frac=0.2) and then merge the reviews to it, we see that its more efficient to filter the reviews first.

Vectorization

Vectorization is the process of executing operations on entire arrays. Similarly to numpy, Pandas has built in optimizations for vectorized operations. It is advised to avoid for loops when working with dataframes, since read and write operations are costly. It is not always possible to vectorize, so i will also show what the best iterative options for pandas are by comparing .iloc[], .iterrows(), .loc[]and.map()/.apply().

What Pandas developers said?

his is the general order of precedence for performance of various operations:

1) vectorization 2) using a custom cython routine 3) apply a) reductions that can be performed in cython b) iteration in python space 4) itertuples 5) iterrows 6) updating an empty frame (e.g. using loc one-row-at-a-time)

Using a custom Cython routine is usually too complicated, so let's skip that for now.

1) Vectorization is ALWAYS, ALWAYS the first and best choice. However, there is a small set of cases (usually involving a recurrence) which cannot be vectorized in obvious ways. Furthermore, on a smallish DataFrame, it may be faster to use other methods.

3) apply usually can be handled by an iterator in Cython space. This is handled internally by pandas, though it depends on what is going on inside the apply expression. For example, df.apply(lambda x: np.sum(x)) will be executed pretty swiftly, though of course, df.sum(1) is even better. However something like df.apply(lambda x: x['b'] + 1) will be executed in Python space, and consequently is much slower.

4) itertuples does not box the data into a Series. It just returns the data in the form of tuples.

5) iterrows DOES box the data into a Series. Unless you really need this, use another method.

6) Updating an empty frame a-single-row-at-a-time. I have seen this method used WAY too much. It is by far the slowest. It is probably common place (and reasonably fast for some python structures), but a DataFrame does a fair number of checks on indexing, so this will always be very slow to update a row at a time. Much better to create new structures and concat.

You can consider the above to be an “antipattern” in Pandas for several reasons. Firstly, it needs to initialize a list in which the outputs will be recorded.

Secondly, it uses the opaque object range(0, len(df)) to loop over, and then after applying apply_tariff(), it has to append the result to a list that is used to make the new DataFrame column. It also does what is called chained indexing with df.iloc[i]['date_time'], which often leads to unintended results.

.itertuples() yields a namedtuple for each row, with the row’s index value as the first element of the tuple. A nametuple is a data structure from Python’s collections module that behaves like a Python tuple but has fields accessible by attribute lookup.

.iterrows() yields pairs (tuples) of (index, Series) for each row in the DataFrame.

Pandas’ .apply() method takes functions (callables) and applies them along an axis of a DataFrame (all rows, or all columns).

But how can you apply condition calculations as vectorized operations in Pandas? One trick is to select and group parts the DataFrame based on your conditions and then apply a vectorized operation to each selected group.

In this next example, you will see how to select rows with Pandas’ .isin() method and then apply the appropriate tariff in a vectorized operation

In apply_tariff_isin(), we are still admittedly doing some “manual work” by calling df.loc and df.index.hour.isin() three times each.

This is a fully vectorized way to get to your intended result, and it comes out on top in terms of timing

NumPy’s digitize() function. It is similar to Pandas’ cut() in that the data will be binned, but this time it will be represented by an array of indexes representing which bin each hour belongs to.

.eval() .query()

Tricks (15 - 20 min)

Further reading

  1. https://realpython.com/pandas-settingwithcopywarning/
  2. https://spark.apache.org/docs/2.4.0/sql-pyspark-pandas-with-arrow.html
  3. https://www.oreilly.com/library/view/python-data-science/9781491912126/
  4. https://pandas.pydata.org/pandas-docs/dev/user_guide/enhancingperf.html